10.4 Dataset Configuration
All configuration of a Quevedo dataset is found in its
configuration file, which is found at the root of the dataset and
named config.toml
. This file is in TOML format, which
makes it ideal for both human and machine editing. We recommend
reading TOML documentation to really understand the format, but it is
an intuitive enough language that you can understand the configuration
file enough to modify it just by reading it.
As a convenience, Quevedo provides the quevedo config
command to edit the
configuration file, but this only launches the user’s configured text
editor with the config.toml
file open.
10.4.1 Local configuration
Quevedo datasets are meant to be shared, and configuration is an
essential part of the dataset. However, some options may be applicable
only for the local environment, and others may be sensitive and best
not distributed. For this, Quevedo also reads a
config.local.toml
if present. The options in the local
configuration file are merged with those in the main file, overriding
them when there is a conflict.
This can be useful for the configuration of darknet installation, which is likely different for different environments, and for the web interface, which may contain sensitive information like secrets and users’ passwords (even if hashed).
10.4.2 Annotation schema
The annotation schema of a dataset is a complex set of information and decisions, but to Quevedo, the important information is the featuers that graphemes, logograms and edges can have. These are lists of strings, and each string represents a possible feature for an annotation object. Each concrete object, then, has a particular value (another string) for each of the appropriate features in its schema. There are four schemas:
g_tags
: Possible features for each grapheme, either bound or isolated.l_tags
: Features for each of the logograms.e_tags
: Features for the edges of the logogram graph.meta_tags
: Additional information that can be stored for isolated annotation files, either graphemes or logograms, but not for bound graphemes.
Tags are represented as dictionary objects both in the annotation
files (in json
format) and in the code (python
dict
s). In the annotation file, apart from their own
tags
, logograms have a list of graphemes found within
them. These bound graphemes have their own tags
from the
g_tags
schema, and an additional piece of information:
the coordinates where they can be found within the logogram image
(a.k.a bounding boxes).
Note that in versions of Quevedo before v1.1, tags were stored as a
list instead of a dictionary. Before v1.3, there was no logogram
annotation schema. If your dataset is using an old structure, Quevedo
will warn you. Please run the migrate
command to upgrade the
dataset.
10.4.3 Other options
darknet
: Configuration for using the darknet binary and library. See “Darknet installation” (Section 10.2.1.1).network
: Configuration for training and using neural networks. See “Network configuration” (Section 10.2.2).pipeline
: Configuration for pipelines which use many networks to solve a task. See “Pipeline configuration” (Section 10.3.1).web
: Configuration for the web interface. See “Web interface configuration” (Section 10.5.1).generate
: These options guide the process of artificial logogram generation used for data augmentation. Seegenerate
.folds
,train_folds
,test_folds
: Thefolds
option sets the default folds that thesplit
will use to partition annotations. Thetrain_folds
option is a list of fold values that will be used to train, and thetest_folds
option respectively for testing. See “Splits and folds” (Section 10.6.6) for more.
10.4.4 Default configuration
When creating a dataset, Quevedo places a default configuration file with comments to ease personalization. The default file is included here for reference: